-
-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBRX Model Support #1462
DBRX Model Support #1462
Conversation
@winglian https://huggingface.co/LnL-AI/dbrx-base-converted-v2 is up with potential better quant compat due to split q,k,v layers. If possible, can you test validate its inference quality vs original? I ran out of gpu to test base inf sanity since all my gpus are maxed out testing training on it. |
Could we also add the above to the readme support matrix for easy viewing?
I wonder if this is due to bnb. |
I will test the 8bit-lora and 16-bit lora. |
I am testing examples/dbrx/8bit-lora.yaml and will update when I get the results. |
return dist.is_available() and dist.is_initialized() | ||
global distributed_state # pylint: disable=global-statement | ||
if not distributed_state: | ||
timeout = int(os.environ.get("AXOLOTL_NCCL_TIMEOUT", 1800)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be good to document this env
def n_loading_workers(quant_method: str, param_count: float): | ||
devprops = torch.cuda.get_device_properties(torch.cuda.current_device()) | ||
left = int(os.cpu_count() / torch.cuda.device_count()) | ||
model_params_b = 70 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved to a function param instead of hardcode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Not sure what to do about this atm. Even the answer.ai fsdp-qlora example has this hardcoded. Not sure there is a good way to get the number of parameters in the model before we actually load the model.
elif cfg.adapter == "lora" and cfg.load_in_8bit: | ||
bnb_config = { | ||
"load_in_8bit": True, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we did not have this. What is the effect of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a warning currently in transformers that passing load_in_8bit will be deprecated soon and that we should use the quantization config instead.
using offload_param for deepspeed zero3, from #1466, per gpu VRAM utilization is ~50GB/gpu @ batch size 1 |
@winglian Please test using tokenizer from https://huggingface.co/LnL-AI/dbrx-base-tokenizer and not the tiktoken one from dbrx which has several problems. It resolves 3 issues I found which negatively affect training:
My tokenizer is based of the one create by hf staff Xenova @ (https://huggingface.co/Xenova/dbrx-instruct-tokenizer). I am trying to validate this with him to see if the tokenizer is 100% encode/decode compatible. https://huggingface.co/Xenova/dbrx-instruct-tokenizer/discussions/1 . We are also doing to do some internal testing on this. My changes:
# pad token
"100256": {
"content": "<|pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
# 15 unused/reserved extra tokens
"<|extra_0|>": 100261
"<|extra_1|>": 100262
...
"<|extra_14|>": 100275 EDIT: removed wrong attribution that |
@Qubitium , hey, was wondering whether that new tokenizer vocab has the same size as the model embed size or whether that needs to be resized as well? |
It's the same size. I did not add any new tokens beyond the original embed size. |
@NanoCode012 Correction. If you believe the embed size EDIT: I am unsure about the resize too now. EDIT: removed wrong attribution that len(tokenizer) != tokenizer.vocab_size in original tokenizer. So issue is just the extra tokens never exposed to encoder and pad token == eos |
Is there anything we need to review from here? I'm hoping to get this merged today if possible. Even if it's only preliminary |
@winglian , I think this can be merged first. The only consequence of the above comment I believe is that, for adapter training, the embed_len and lm_head needs to be targeted due to resize. |
…rocess group, remove redundant wandb callback
When fsdp'ing with meta-llama/Meta-Llama-3.1-70B-Instruct, I'm using the error So I tested it with meta-llama/Meta-Llama-3-70B-Instruct to see if it trains normally, and it does. How can you resolve this error? Please give me hint! We'll share the error, config, and training code for your reference. error message
config
training
|
* wip for dbrx finetuning * add fastcore for parallel loading of sharded weights * fix dtype for load, use PartialState instead of accelerator to init process group, remove redundant wandb callback * update to use v2 of the converted model * more fixes for dbrx loras * make sure to enable fsdp activation checkpointing * fix support for 8bit loras too for dbrx * apply z3 leaf moe fix for DBRX with deepspeed * don't raise value error since child module searches could fail and be ok * revert a previous change to fix fsdp * update mistral/mistral qlora+fsdp yamls * fix qlora+fsdp quant storage type * more edge cases for qlora-fsdp * fixes for fsdp+qlora w optimizer in 8bit * add bigstral z3 config and make sure to use full_state_dict for fsdp
DBRX MoE
Currently, for LoRA, only the
q_proj
,k_proj
,v_proj
out_proj
andlayer
Linear layers are trainable.We are using the "converted" base models based on this issue
where the Experts are fused as an
nn.Parameter
rather than ann.Linear
layer. However, the implementationis still a bit buggy and attempting to train a LoRA adapter over those
w1
,w2
andv1
layersresults in the trainer hanging.
FSDP
We've tested using the
LnL-AI/dbrx-base-converted-v2
model as the base model for FSDP.The high memory usage seen w/ FSDP is due to FSDP not supporting 8bit optimizers.
paged_adamw_8bit
optimizer errors from being on cpuError an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu
Deepspeed
WIP